Debiasing SHAP scores in random forests
نویسندگان
چکیده
Abstract Black box machine learning models are currently being used for high-stakes decision making in various parts of society such as healthcare and criminal justice. While tree-based ensemble methods random forests typically outperform deep on tabular data sets, their built-in variable importance algorithms known to be strongly biased toward high-entropy features. It was recently shown that the increasingly popular SHAP (SHapley Additive exPlanations) values suffer from a similar bias. We propose debiased or "shrunk" scores based sample splitting which additionally enable detection overfitting issues at feature level.
منابع مشابه
Forests in Random Graphs
Let G be a graph and let I(G) be defined by I(G) = max{|F | : F is an induced forest in G}. Let d = (d1, d2, . . . , dn) be a graphic degree sequence such that d1 ≥ d2 ≥ · · · ≥ dn ≥ 1. By using the probabilistic method, we prove that if G is a graph with degree sequence d, then I(G) ≥ 2 n ∑
متن کامل1 Random Forests - - Random Features
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The error of a forest of tree classifiers depends on the strength of the individual tre...
متن کاملMondrian Forests: Efficient Online Random Forests
Ensembles of randomized decision trees, usually referred to as random forests, are widely used for classification and regression tasks in machine learning and statistics. Random forests achieve competitive predictive performance and are computationally efficient to train and test, making them excellent candidates for real-world prediction tasks. The most popular random forest variants (such as ...
متن کاملRandom Forests In Language Modeling
In this paper, we explore the use of Random Forests (RFs) (Amit and Geman, 1997; Breiman, 2001) in language modeling, the problem of predicting the next word based on words already seen before. The goal in this work is to develop a new language modeling approach based on randomly grown Decision Trees (DTs) and apply it to automatic speech recognition. We study our RF approach in the context of ...
متن کاملRandom Composite Forests
We introduce a broad family of decision trees, Composite Trees, whose leaf classifiers are selected out of a hypothesis set composed of p subfamilies with different complexities. We prove new data-dependent learning guarantees for this family in the multi-class setting. These learning bounds provide a quantitative guidance for the choice of the hypotheses at each leaf. Remarkably, they depend o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: AStA Advances in Statistical Analysis
سال: 2023
ISSN: ['1863-8171', '1863-818X']
DOI: https://doi.org/10.1007/s10182-023-00479-7